Pandas Exploratory Data Analysis
Data analysis is the process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information and informing conclusions to support decision-making. This involves applying inferential statistical analyses and creating visualizations in order to interpret the results and summarize the main characteristics of the data. The typical information and conclusions extracted from the data include interactions, patterns, and anomalies. These notes rely on the ideas and learnings from the respective package documentations, "Python For Data Analysis: Data Wrangling With Pandas, NumPy, And Jupyter", 3rd Edition, by Wes McKinney (creator and developer of Pandas) in 2022, and "Python Data Science Handbook: Essential Tools For Working With Data", 2nd Edition, by Jake VanderPlas in 2022.
When using Python for data analysis and data science, the most common packages and libraries include NumPy as the numeric library founding all of the calculations; Pandas as the cornerstone of data manipulation; Matplotlib, Seaborn, and Plotly for intricate visualizations; Statsmodels for advanced statistical functions; SciPy for advanced scientific computing; and Scikit-Learn for as a toolkit for machine learning; and TensorFlow and PyTorch for artificial intelligence applications. For convenience, Anaconda can be used for a distribution of Python with pre-installed packages which focus on data analysis and data science. In many cases, the vast and open network of packages and libraries available for Python can be leveraged depending on the requirements of projects. It should be kept in mind that the use of Python diverges from traditional tools used for data analysis which are primarily visual through point-and-click interfaces, such as Microsoft Excel and Tableau, and become difficult to use when processing very large sets of data.
The process of data analysis generally follows data extraction, data cleaning, data wrangling, analysis, and resultant action (although the process usually resembles a repetitive cycle rather than a linear path). The data extraction is associated with sourcing data from local or online databases stored in SQL, CSV, XML, JSON, or another file format. The data cleansing is associated with accounting for missing values, empty sets, invalid fields, and other errors. The data wrangling is associated with merging, combining, and joining data to re-arrange and re-shape the data into categories, hierarchies, or indices. The analysis is associated with exploration to extract results from the data which involves statistical analyses to identify the underlying trends and characteristics. The resultant action is associated with the subsequent recommendations which result from the knowledge gained from the overall process.
Installation And Setup
Pandas (short for "Panel Data" and associated with "Python Data Analysis") provides high-level functionality designed to make using structured data convenient and flexible. This functionality is built on the functionality of NumPy, ..., and ... and allows for capabilities expanding from intuitive indexing and data manipulation. The vast applicability of Pandas provides for capabilities for reading and writing a variety of file formats and data stores; cleaning, munging, combining, normalizing, reshaping, slicing, and transforming data; applying mathematical and statistical operations and transformations to groups of data to derive new sets of data; connecting data to statistical models, machine learning algorithms, and other computational tools; and creating static or interactive graphical visualizations or textual summaries for presentation.
The prerequisites to install Pandas are NumPy, Dateutil, and Pytz (with more optional dependencies for performance, visualization, computation, and other data sources). Pandas can usually be installed through a package manager, as conventionally performed using Pip, or, alternatively, through the native package manager of a Linux distribution (although this version may be outdated or may not be officially maintained). For advanced developers, Pandas can be built and installed from its source code with control over options for compiling. Once installed, Pandas can be imported into a project.
pip install pandas
pip install --upgrade numpy
conda install pandas
conda update pandas
import pandas
import pandas as pd
...
pandas.options.display.max_rows = 20
pandas.options.display.max_columns = 20
pandas.options.display.max_colwidth = 80
...
... Creation
With regard to structured data, the most common forms include tabular data (in which each column may be a different type, but each value in a column is the same type), multi-dimensional arrays, multiple tables interrelated by key columns, and evenly or unevenly spaced time series data. In Pandas, structured data can be represented through a series, as a homogeneous and 1-dimensional array with labels for an index, or data frame, as a heterogeneous and column-oriented table with labels for rows and columns (although 2-dimensional, it is possible to represent higher dimensional data in a tabular format using hierarchical indexing). From a high level, a data frame is a programming interface for expressing data manipulations on tabular datasets in a general programming language and whose primary modality is analytical. Compared to SQL-based systems, data frames often use imperative or procedural constructs (emphasis on iterations with a sequence of operations for manipulation), offer access to internal structures, expose operations outside of traditional relational algebra (taking advantage of ordering of records within datasets), and have stateful semantics (...).
A series can be created from a list, dictionary, or array with the index being optionally defined to identify each value with a label. For an alternative perspective, a series can be thought of as a fixed-length and ordered dictionary, as it is a direct mapping of data and index values. A data frame can be created from a list, dictionary, or array with the index and columns being optionally defined to identify each value with labels (using a nested dictionary of dictionaries, the keys of the outer dictionary will be used for the columns and keys of the inner dictionary will be used for the rows). For an alternative perspective, a data frame can be thought of as a dictionary of series which share the same index. It should also be noted that the index of a series or data frame behaves like a fixed-size set (with allowance for duplicate values), where there are also specialized types, such as for monotonic integers, intervals, time, or multi-level objects.
series (data = None, index = None, dtype = None, name = None, copy = None, fastpath = False)
pandas.DataFrame (data = None, index = None, columns = None, dtype = None, copy = None)
pandas.Index (labels = None, dtype = None, copy = False, name = None, tupleize_cols = True)
pandas.RangeIndex (start = None, stop = None, step = None, dtype = None, copy = False, name = None)
pandas.IntervalIndex (data, closed = None, dtype = None, copy = False, name = None, verify_integrity = True)
pandas.DatetimeIndex (data = None, freq = _NoDefault.no_default, tz = _NoDefault.no_default, normalize = False, closed = None, ambiguous = "raise", dayfirst = False, yearfirst = False, dtype = None, copy = False, name = None)
pandas.TimedeltaIndex (data = None, unit = None, freq = _NoDefault.no_default, closed = None, dtype = None, copy = False, name = None)
pandas.PeriodIndex (data = None, ordinal = None, freq = None, dtype = None, copy = False, name = None, **fields)
pandas.MultiIndex (levels = None, codes = None, sortorder = None, names = None, dtype = None, copy = False, name = None, verify_integrity = True)
The information and properties intrinsic to a series, data frame, or index are reflected by the attributes of the series, data frame, or index. For a series, the common attributes include the underlying index, underlying array of data, shape as the size along each dimension, and name assigned to the series. For a data frame, the common attributes include the underlying index for the rows and columns, underlying array of data, and shape as the size along each dimension. For an index, the primary attributes includes the labels of the axis and other metadata, such the shape as the size along each dimension and name assigned to the index (it should be noted that these objects are immutable and cannot be directly modified).
A series can be indexed to create a slice by a list or tuple of integers (positional indexing), booleans (logical indexing), or labels associated with the index (dictionary-like notation). A data frame can also be indexed to create a slice by a list or tuple of integers (positional indexing), booleans (logical indexing), or labels associated with the index and columns (dictionary-like notation). It is also possible to use dot notation to access a column of a data frame (attribute-like notation (must be a valid variable name), although this cannot be used to create a new column). However, the preferred way of indexing is actually using the supplied methods for positional, logical, or label indexing, as this allows for more consistent behaviour regardless of the data type used and avoids ambiguity. As mentioned, the result of selecting a single row or column in a data frame is a series with an index which contains the column or row labels. Accommodation can also be made to select single scalar values directly within a series or data frame for improved performance with less overhead.
...[A distinction needs to be made between basic indexing and advanced indexing. The primary difference between basic indexing and advanced indexing is that basic indexing will only select a slice from an array, while advanced indexing will select an arbitrary group from an array (allows for repetition of indices). Under basic indexing, a slice of the original array is referenced, where this slice is a view (use the same values in memory) and any modification to the view will be reflected in the original array (need to explicitly specify a copy to create a new object). Under advanced indexing, a group from the original array is created, where this group is a copy and ...acts as... a new object. It should be noted that selecting data by boolean indexing and assigning the result will always create a copy of the data. In addition, the search order for indexing is row-major (fill the consecutive elements of a row before moving to subsequent rows).]...
Basic Data Manipulation
Although there are several general functions, most of the manipulation of data is performed through methods. For data alignment, the index of a series or data frame can be modified to re-order existing data and fill locations without values (with a NaN
by default). In a similar way, it is possible to set the index using a column of a data frame or reset the index of a series or data frame. Additional values can be inserted into or appended onto an array, while other values can be deleted using the appropriate indices. In addition, it should be noted that a series or data frame often behaviours similar to a ndarray
and can often be used as an input to universal functions from NumPy (or converted into an array, where the data type will be chosen to accommodate all of the columns).
series.reindex (index = None, *, axis = None, method = None, copy = None, level = None, fill_value = None, limit = None, tolerance = None)
data_frame.reindex (labels = None, *, index = None, columns = None, axis = None, method = None, copy = None, level = None, fill_value = nan, limit = None, tolerance = None)
series.align (other, join = "outer", axis = None, level = None, copy = None, fill_value = None, method = None, limit = None, fill_axis = 0, broadcast_axis = None)
data_frame.align (other, join = "outer", axis = None, level = None, copy = None, fill_value = None, method = None, limit = None, fill_axis = 0, broadcast_axis = None)
series.rename (index = None, *, axis = None, copy = None, inplace = False, level = None, errors = "ignore")
data_frame.rename (mapper = None, *, index = None, columns = None, axis = None, copy = None, inplace = False, level = None, errors = "ignore")
series.set_axis (labels, *, axis = 0, copy = None)
data_frame.set_axis (labels, *, axis = 0, copy = None)
series.reset_index (level = None, *, drop = False, name = _NoDefault.no_default, inplace = False, allow_duplicates = False)
data_frame.reset_index(level = None, *, drop = False, inplace = False, col_level = 0, col_fill = "", allow_duplicates = _NoDefault.no_default, names = None)
series.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
data_frame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
pandas.Index.drop(labels, errors='raise')delete
Calculations And Operations
When performing arithmetic operations, the corresponding values with a common index are added, subtracted, multiplied, or divided, where missing values (or a chosen fill value) will be introduced for the locations which do not have a common index. Thus, the results will have an index which is a union of the index from each of the parts of the operation (using series or data frames which have no common index will result in only missing values). This is comparable to performing the operations in an element-wise manner as a batch operation. Similarly, logic functions can be used to evaluate an array with the results given as a boolean in an element-wise or matrix-wise manner - common logic functions include evaluation whether arrays are greater than, less than, or equal to variants. ...If the arrays are not the same shape, they must be broadcastable to a common shape along the dimensions... It should be noted that, due to this broadcasting and whenever an operation involves an array with a scalar, an element-wise operation will be performed, where the scalar is applied to each element of the array based on the operation.
data_frame_augend.add (data_frame_addend, axis = "columns", level = None, fill_value = None)
data_frame_minuend.sub (data_frame_subtrahend, axis = "columns", level = None, fill_value = None)
data_frame_multiplicand.mul (data_frame_multiplier, axis = "columns", level = None, fill_value = None)
data_frame_dividend.div (data_frame_divisor, axis = "columns", level = None, fill_value = None)
data_frame_dividend.mod (data_frame_divisor, axis = "columns", level = None, fill_value = None)
data_frame_base.pow (data_frame_exponent, axis = "columns", level = None, fill_value = None)
data_frame_addend.radd (data_frame_augend, axis = "columns", level = None, fill_value = None)
data_frame_subtrahend.rsub (data_frame_minuend, axis = "columns", level = None, fill_value = None)
data_frame_multiplier.rmul (data_frame_multiplicand, axis = "columns", level = None, fill_value = None)
data_frame_divisor.rdiv (data_frame_dividend, axis = "columns", level = None, fill_value = None)
data_frame_divisor.rmod (data_frame_dividend, axis = "columns", level = None, fill_value = None)
data_frame_exponent.rpow (data_frame_base, axis = "columns", level = None, fill_value = None)
data_frame_standard.gt (data_frame_basis, axis = "columns", level = None)
data_frame_standard.ge (data_frame_basis, axis = "columns", level = None)
data_frame_standard.lt (data_frame_basis, axis = "columns", level = None)
data_frame_standard.le (data_frame_basis, axis = "columns", level = None)
data_frame_standard.eq (data_frame_basis, axis = "columns", level = None)
data_frame_standard.ne (data_frame_basis, axis = "columns", level = None)
If there is not a built-in function, it is possible to create a function performing the desired operations and then apply this function along an axis of or as an element-wise operation on a data frame. A distinction can also be made when transforming data (but keeping a consistent structure) relative to aggregating data (and having a modified structure). However, in most cases, there will be a suitable built-in function to use for statistics (mean, median, standard deviation, etc), sorting (numerical, alphabetical, ascending, descending, etc), and sets (unique, ..., etc).
data_frame.apply (function_handle, axis = 0, raw = False, result_type = None, args = (), **kwargs)
data_frame.applymap (function_handle, na_action = None, **kwargs)
data_frame.transform (function_handle, axis = 0, *args, **kwargs)
data_frame.aggregate (function_handle = None, axis = 0, *args, **kwargs)
... min max sum mean median
...
...
...
...
...